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ABSTRACT 

Wo prosont »n olforlthm for tho Itorottvo solution of 
flntto element prohloms on s concurront processor. Tho 
method of conjugate grsdtonti I* used to solve tho system 
of motrlx equations, which It distributed omong tho 
processors of s MIMO computer according to tn 
olomont-basod spatial decomposition. This algorithm la 
Implomontsd in a two-dimensional olaatoatatlca program 
on tho Caltech Hyporcube concurrent processor. Tho 
rooulta of toots on up to 32 processors show nearly linear 
concurront speodup, with efficiencies over 901 for 
sufficiently largo problems. 

IHTKOOUCTIOM 

Flnlts fitments and Elliptic Boundary Value Problems 

The need to solve boundary value problems Involving 
elliptic partial differential equations on geometrically 
complex domains arises In many engineering contexts, 
and Increasingly In a wide variety of scientific fields. Tho 
finite olomont method, which hat boon trotted In a largo 
number of texts (o.g., Zlonkowici.l), provides t 
flexible and powerful numerical technique for the solution 
of such problems. Tho physical problems which have 
boon solved by finite element methods coma from auch 
diverse fields as structural and continuum mechanics, 
fluid dynamics, hydrology, heat flow analysis and many 
others. 
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Among tho common features of most finite olomont 
applications la tho need to form and to solve a matrix 
equation of tho form: 

A 1 » k (I) 

Hero, I It a vector of unknown quantities, to b* 
solved for at each of some number of 'node* locations 
within tho problem domain. Tho grid of nodes (usually 
representing a discretisation of some physical spatial 
domain) ala* doflnoa a subdivision of tho domain into 
volumes (or areas) called •elements’ which Jointly 
co iprlso the entire problem domain and share nodot 
along their boundaries (Fig. I ). 


Fi*.l Flail* eloBtot discretiniiaa of ebauadtry value 
p roble m 

Tho stiffness matrix A contains tho terms which 
determine tho Interaction among tho different unknown 
degrees of fraodom tn I, end the forcing terms or 
boundary vtluot are Introduced through tho right hand 
side, k. Operationally, entries of A sro computed by 
performing Integrals over tho elements which Inciudo tho 
node in question. The stiffness matrix therefore consists 
of an assembly of Individual element matrices, which are 
mapped Into tho ’global’ matrix through a maatar 
equation bookkeeping scheme. 




rural -»*>!«•' 



ORIGINAL Pi. tel 1 
OR POOR QUALITY 


Common (o moot ouch flnlto olomont applications It 
the problem or solving tho tyolom given in oquttlon (I). 
Since iho matrix A mty bo vory lorgo (especially In 
higher-dimensional problems), consioorsolt (bought 
mutt go Into tolvlng It ifflclontly. Tbit 
computation-intensive slop It tho prlmtry mottvotlon (or 
looking to concurrent completion tochnlguot. 

A ttlffnoit matrix A typically hit a number of 
proptrtitt which Influonco tho choice of solution 
technique. For example, many phyilca! problomt give 
rlto to a eym metric positive definite matrix. For tho 
rtmtlnder of tho present discussion wt thill restrict 
ourttlvot to this class of flnlto olomont problems. This 
doss not moan that mors general esses cannot bo treated 
using concurrent techniques similar to these described 
below, but such problems are not within the scope of this 
srork. 

Traditionally, derivatives of Gaussian elimination 
have been used to solve these systems, giving robust 
performance under s wide range of matrix Ill-condltlonlng 
clrccmstancss. Matrix Ill-condltlonlng msy occur If a 
given system It very nearly singular, or exhibits a very 
large ratio between the largest and smallest eigenvalues, 
In which csss carttin matrix operations which are 
sensitive to round-otf error may be Inaccurate. While 
direct techniques such ts Gaussian summation are not 
particularly sensitive to moderate Ill-condltlonlng, they 
sntsll t largo amount of computational work. Whether 
dealing with Iterative or dlract techniques, the matrix 
problem nearly always raprasants the dominant 
computational cost of a finite element calculation. In the 
case of Gaussian elimination, this coat Increases rapidly 
with problem site, being proportional to \Jr , where 
is the total number of degrees of freedom; and b la the 
mean diagonal bandwidth of non-zero elements In A. 

In a typical assembled finite element matrix, the 
entries In a given column or row are non-zero only wtthtn 
a given distance from the diagonal. This la because a 
given node only contributes Interaction terms from those 
nodes with which it shares an element. Although Judicious 
nods numbering can minimize the average column height 
and thus tho moan diagonal bandwidth of tho matrix, 
problams with large numbers of nedas in mora than one 
spatial direction will unavoidably have very large 
bandwtdths. This means that, especially for grids In 
higher dimensions, the work necessary to solve the 
system will increase much faster than the number of 
nodes as wo attempt to solve larger and larger systems. 

Certain classes of structural analysts problems may 
have considerably smaller mean bandwtdths than the moat 
general three-dimensional problem, and for these, direct 
techniques remain a viable approach. Research Into 
concurrent methods for these techniques la already well 
sdvanced( 2.3.a ). The present authors have been moat 
Interested In continuum mechanics problems, for which 
direct methods are more limited In their utility. 

For the above reasons, we hove pursued the 
Implementation of an Increasingly popular Iterative 
technique, the method of conjugate gradients. Assuming 
that It can be made to converge, tho conjugate gradient 
method offers the potential for much Improved 
performance on large three-dimensional grids. This, 
combined with a rather straightforward concurrent 
generalization made this an attractive area to explore. In 
addition, tho symmetric positive-definite nature ef the 
matrices considered makes the method of conjugate 
gradients a good drat choice (see for example 
Jennings, J). In the discussion which follows, we arid 


present the basic concurrent algorithm developed for this 
application. 

Figure 2 shows schematically how a two-dimensional 
finite element grid might be spread over the processors 
of an 8-n.ide MIMO machine. Each subset of elements 
(delineated by tne heavy lines) la treated affectively as a 
separate smaller finite element problem within each 
processor. Adjoining subdomains need only exchange 
boundary Information with neighboring processors In 
order to complete calculations for each region 
concurrently. The next section dfacuaaes In more detail 
how this concurrency Is managed. 





Fig. 2 A possible distributism of a twe-dimaeaioeal 
grid oa aa S-procaaaor concurrent computer 

The Concurrent Decomposition 

The conjugate gradient method represents a 
technique for Iterattvaly searching the space of vectors I 
In such a way aa to minimize a function of tho residual 
errors. Briefly, the conjugate gradient procedure 
consists of the following algorithm. Initially, 

Hal.ptal.fc - AX<*I (2) 

Then, for the k* step, 

1. Ap te >) 

2. «*•'». gW.^pW 

3. r tt ‘“-f tt '-a t A»W (3) 

i. ^ ) ✓ (r»» - r< k » ) 

J. ptk.ll.ftt.ll.^ptkl 

6. k • k‘l; gala! (ctali ant a a ul tarn car gal) 

As may be seen by Inspecting (3) above, this 
algorithm Involves two basic kinds of operations. The 
first of these Is the vector dot product, for example 
rtkl.flkl ^a second basic operation la tha matrix-vector 
eroduct, Ap* 1 '. Both of those primitive operations can be 
done In parallel by decomposing the problem Into regions 
of tho physical domain apace. A given 'global* vector such 
aa r or p la spread out among tha processors of a 
concurrent ensemble, with concurrent operations being 
performed on the various 'pieces* of the vectors m each 
processor. The only need for Information from oidslle a 
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given processor occurs when nodal on tha boundary 
between thi 'Jurisdiction!* of two precision in 
computid. 

Figure 3 lllustntii ichimitlcilly tho protocol usid 
for handling mind dogrooi of fnodom botwotn 
processors In thi two-dimensional cm. Each 
proconor oboyi i convention whoraby It accumulates 
contributions to global victor qusntltlos ( I, r, P ) 
from noighborlng processor! along tho 'right* and 'down' 
adgaa of tho region. Contributions arising from dogma 
of fnodom along a procoaaor'a 'loft* and ‘up* adgaa are 
asnt to tho respective neighboring processor for 
accumulation. Tho ‘lowor right* and 'upper loft* corner 
degress of freedom are passed twice tn reaching their 
destination. 




While only one processor has responsibility for 
accumulating s given shared degree of freedom, each 
neighboring processor must be provided with s copy of the 
accumuleted result for subsequent calculations. Thus, 
the process of updating s quantity such is Ap**', which 
Involves adding contributions from different olemsnts In 
different processors, proceeds as follows: 

1. The *locsl* contributions sro calculated within 
each processor Independently. This means calculating 
the contributions to the given matrix or vector product 
while neglecting tho efloct of any neighboring processor. 

2. A right-to-left pass of edge contributions la 
done. Vector contributions along the left edge of a 
processor ( I. a. , those labeled SL, sL, and sX) are sent 
to the neighboring processor on the toft, at the some time 
that the corresponding right edge degrees of freedom 
(AR, sX, and all) ire received from tha right. Tho 
received contributions sro added to tho locally calculated 
contributions already resident in those storage 'slots*. 

3. This Is followed by a down-to-up pass. Ths 
contributions labeled Sli and sU are sent up, while the AO 
and sX slots rscslvs contributions to sccumulots. Note 
that the corner degrees of freedom labeled sX are 



transmitted and accumulated twice before reaching their 
final destination tn an aX slot located diagonally from 
tholr starting point. 

4. Finally, an up-to-down shift, followed by s 
laft-to— right shift which overwrites rather than sdda to 
ths current contents of tho boundary arrays serves to 
distribute the the final sum of oil contributions to all ths 
Invo'ved processors. This ends the communication cycle, 
end the processors t'isn return to concurrent Internal 
computations. 

Aside from this kind of calculation, In which s 
global vector la updated on tho basis of local (olemont) 
Information, there Is one other situation in which 
processor 'responsibility* for global degrees of freedom 
Is significant. Global scalar products are calculated by 
forming partial dot products within each processor, which 
are In turn forwarded to a single 'control process* (which 
In this esse resides In a separate external processor) 
which performs the summation and takes action on the 
result (e.g., terminating the iterative loop). In this 
csss, In order to avoid ‘double-counting* any entry, 
each processor calculates its partial dot product only 
using Its Internal and accumulated deqrees of freedom 

The scheme described above la applicable to ths 
hypercubs machine architecture specifically, and more 
generally to any M 1110 computer which supports the 
following communications operations: (I) global 

broadcast of data from a designated controlling process 
or processor to the concurrent array, (2) transmission 
of unique dots 'messeges* between errsy elements and the 
control process, and (3) element-to-elsment data 
transmission between lattice nearest neighbors. 

IMRUUIMTATION 


The example discussed hers was written in the C 
language, and cross-compiled on s VAX- 1 1/750 system 
for execution on the bOM microprocessor-based 32 node 
Caltech Hypsrcubs (Mirk II) machine. Listings and 
further information on this code may be obtained by 
contacting one of the authors. 

Thors are no basic algorithmic differences between 
the concurrent conjugate gradient algorithm discussed 
here and ns equivalent counterpart on a sequential 
machine. Thus tha concurrent program should, assuming 
Infinite precision of calculation, yield results Identical to 
ths sequential version. 

in actual practice however, the exact result 
obtained by the conjugate gradient method, and the 
number of nerstlons required for It to converge, ore 
rather senetttve to the finite precision of tha numerical 
calculations. The accumulation stops which occur during 
Intsrproceaior communications represent additional 
arithmetic operations of finite precision, which introduce 
asms additional round-off error not present in tha 
sequential equivalent. This causes the concurrent coda to 
produce slightly different (but numerically equivalent) 
results. 



The computation/communication structure of tha 
concurrent algorithm la quite regular, In that each 
processor Is simultaneously doing the same type of task 
as every other, and tha only load Imbalance or processor 
waiting la caused by giving processors responsibility for 
different numbers of elements or degrees of freedom. 
The problem of how to beat decompose an arbitrarily 
shaped finite element grid presents a problem In tha 
preparation of ths Input data, but this preprocessing 


•tap hn nothing to do directly with tho octuol operation 
of tho concurrent algortthm(&) . Highly Irregular end/or 
non-rectlllneer gride ere handled treneperently, Juet ea 
long ae ell shared boundary nodra era properly Identified 
(aa par Fig. 3) In the input data stream. 

Among tha apodal conaldaratlona to note for thla 
application are raatrictlona on avallabla memory. In tha 
praaant example, we have choaen to calculate and atore 
each element atiffneaa entry. In auch a caaa, It becomea 
poaaible to exheuat the relatively modeat memory 
available in the preaent generation of machlnea. Thla 
problem la particularly acute In moving from two- to 
throo-dlmenelonal problema. One approach to overcome 
thla limitation la not to atore, but to recalculate 
atlffneeeea at each iteration. In auch a caaa, tha extra 
arithmetic operetiona may be mlnlmlied by utilizing 
one-point Quadrature with meah atabllliatlon tochnlquea 

<Z>- 


amallor problema' axecution tlmaa. in thla 
extrapolation, we have aaaumed a atrictly linear relation 
between number of oquatlona and axecution time per 
Iteration. The alngle-procaaeor proportionality conatant 
la aaaumed here to ba 6.41 mtlllaecond / iteration per 
degree of freedom. The quoted experimental efftclendea 
are valid to the degree that thla acallng aaaumption holda. 


In a problam which la perfectly lead balanced, the 
concurrent efficiency will be governed by the ratio of time 
epent communicating to time apent calculating 
concurrantly. Slnca In a two-dtmenalonal array, the 
number of communicated edge valuee vartea aa n*' 2 , 
where o la the totel number of equettona per proceaaor, 
thla ratio ahould vary ea a' 1/2 and the calculation ahould 
obey an efficiency relation, 

« - 1 - Cn‘ ,,J . (4) 


Beaulta of Inlllil Teala 

Tha above deacrlbed finite elemant/conjugate 
gradient program waa teated on a aertea of 
two-dimenaionel plane atreln elaatoatatlca problema, 
Intended to a) verify the proper execution of the program 
and b) provide benchmark meaaurementa of the 
program'a concurrent efficiency aa a function of problem 
alia and number of proceaaora. 

The nark II hypercuba waa uaed in theae teata In 
configuration from one node (0-dlmenalonal cube) 
through the full 32 nodea (3-dlmenaionel hypercube). 
The actual problema solved were rectangular arrays of 
elements with boundary displacements Imposed to produce 
a simple shear field solution. Concurrent efficiency la a 
measure of how nearly a concurrent proceaaor with m 
processors approaches speeding e given calculation by a 
factor of m. Thus we obtain experimental efficiencies by 
dividing the time required to solve a given problem in e 
single processor by the concurrent time and tha number 
of processors. As long as the sequential and concurrent 
algorithms are truly equivalent, this efficiency d has an 
upper bound of unity. 
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In Table I, we summarize the results of theae teat 
runs on a variety of problem aliee and numbers of 
processors. Since the maximum number of elements we 
could accomodate within a single processor was on the 
order of 600, the efficiencies Hated for the larger 
32-node runs are baaed upon extrapolations of the 


where C la e conatant. A log- log plot auch aa Figure 4 
may be uaed to compere the above efficiencies with this 
theoretical prediction. 
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Fig. 4 Ceanirreet speedup efficiency ae • Evacuee of 
problem tat end aomber ef processors 


On thla plot, star aymbola Indicate the 32-node 
efficiencies at varloua problem sizes. The results seem 
to agree Mil with the predicted s/opt • Ui line. The 
points representing those cases with fewer then 16 
processors plot above the 32-node trend because they do 
not Involve proceaaora communicating on all four 
boundaries. In summary, the concurrent conjugate 
gradient algorithm exhibits performance characteristics 
Mich ere theoretically understood, and well suited for 
application to large ensembles ef processors. 

It le Interesting te note that the experimental value 
for the constant C In (4) la apparently approximately l. 
This result, which agrees with the results of fleler(A) for 
a two-dimensional flnlts difference scheme on the same 
machine, la a reflection of the fact that the times for 
calculation end communication of a single degree of 
freedom on thla hardware are roughly equal. 
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CONCLUSIONS 


KXmiNCIS 


We find that the concurrent conjugate gradient 
algorithm outlined hara maata aur expectations In 
providing largo, naarly llnaar apaadup In flnlta alamant 
lyatam aolutlana. Results from actual runs an (ha 
32-nede nark II hypareuba ayatam ylald not afflcianclaa 
upward of 901. From thla wa can conclude that future 
major uaa of thla and rotated algorithm! an largo flnlta 
alamant eppllcatlena will hinge upon the applicability of 
iterative techniquea, and not upon any laaua of 
concurrency or efficiency. 

Several araaa are now Indicated for future reeearch 
in thla area. Among the meat Immediate naada are: 

( l ) effective preconditioned te apeed convergence of the 
CC algorithm, (2) extension to throe dtmenalena, (3) 
inveatlgatlon of mathada to reduce or eliminate atlffneaa 
atorage rogulrementa, and parhape moat Importantly, 
(4) automated pracaduraa far breaking up and balancing 
an arbitrary flnlta alamant problem among praceaaera. 
Wa anticipate addreealng thaaa prablama In the near 
future. 
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